Skip to content

[Enhancement](tvf) Table value function support reading local file#17404

Merged
morningman merged 1 commit into
apache:masterfrom
gitccl:feat_17308
Aug 10, 2023
Merged

[Enhancement](tvf) Table value function support reading local file#17404
morningman merged 1 commit into
apache:masterfrom
gitccl:feat_17308

Conversation

@gitccl

@gitccl gitccl commented Mar 3, 2023

Copy link
Copy Markdown
Contributor

Proposed changes

Issue Number: close #17308

Problem summary

I tested the local tvf with tpch queries. First, generate lineitem datasets with 6001215 rows, and load it into lineitem table by:

insert into lineitem select c11, c1, c4, c2, c3, c5, c6, c7, c8, c9, c10, c12, c13, c14, c15, c16 
from local(
        "file_path" = "tools/tpch-tools/bin/tpch-data/lineitem.tbl.1", 
        "backend_id" = "10003", 
        "format" = "csv", 
        "column_separator" = "|"
);

Then, run q1 and q16 tpch queries, the query result is correct.

It can also analyze the BE's log directly like:

mysql> select * from local(
        "file_path" = "log/be.out",
        "backend_id" = "10006",
        "format" = "csv")
       where c1 like "%start_time%" limit 10;
+--------------------------------------------------------+
| c1                                                     |
+--------------------------------------------------------+
| start time: 2023年 08月 07日 星期一 23:20:32 CST       |
| start time: 2023年 08月 07日 星期一 23:32:10 CST       |
| start time: 2023年 08月 08日 星期二 00:20:50 CST       |
| start time: 2023年 08月 08日 星期二 00:29:15 CST       |
+--------------------------------------------------------+

Checklist(Required)

  • Does it affect the original behavior
  • Has unit tests been added
  • Has document been added or modified
  • Does it need to update dependencies
  • Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions Bot added the area/planner Issues or PRs related to the query planner label Mar 3, 2023
@github-actions

github-actions Bot commented Mar 7, 2023

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman morningman self-assigned this Mar 24, 2023
@gitccl gitccl marked this pull request as ready for review May 21, 2023 03:26
@github-actions github-actions Bot added the kind/docs Categorizes issue or PR as related to documentation. label May 21, 2023
@gitccl

gitccl commented May 21, 2023

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@gitccl

gitccl commented May 21, 2023

Copy link
Copy Markdown
Contributor Author

run buildall

@gitccl gitccl requested a review from morningman May 24, 2023 01:28
Comment thread fe/fe-core/src/main/java/org/apache/doris/planner/external/TVFScanNode.java Outdated
Comment thread fe/fe-core/src/main/java/org/apache/doris/planner/external/TVFScanNode.java Outdated
@github-actions

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

morningman
morningman previously approved these changes May 28, 2023

@morningman morningman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman

Copy link
Copy Markdown
Contributor

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 28, 2023
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 28, 2023
@gitccl

gitccl commented May 28, 2023

Copy link
Copy Markdown
Contributor Author

run buildall

morningman
morningman previously approved these changes May 28, 2023

@morningman morningman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 28, 2023
@github-actions

github-actions Bot commented Aug 8, 2023

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen

Copy link
Copy Markdown
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.34 seconds
stream load tsv: 516 seconds loaded 74807831229 Bytes, about 138 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162135191 Bytes

morningman
morningman previously approved these changes Aug 8, 2023

@morningman morningman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Aug 8, 2023
@github-actions

github-actions Bot commented Aug 8, 2023

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

glob_t glob_result;
memset(&glob_result, 0, sizeof(glob_result));

int rc = glob(pattern.c_str(), GLOB_TILDE, NULL, &glob_result);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int rc = glob(pattern.c_str(), GLOB_TILDE, NULL, &glob_result);
int rc = glob(pattern.c_str(), GLOB_TILDE, nullptr, &glob_result);

Comment on lines +83 to 86

protected String getFsName(FileSplit split) {
return tableValuedFunction.getFsName();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected String getFsName(FileSplit split) {
return tableValuedFunction.getFsName();
}
@Override
protected String getFsName(FileSplit split) {
return tableValuedFunction.getFsName();
}

@morningman

Copy link
Copy Markdown
Contributor

run buildall

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Aug 9, 2023
@github-actions

github-actions Bot commented Aug 9, 2023

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions

github-actions Bot commented Aug 9, 2023

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen

Copy link
Copy Markdown
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.16 seconds
stream load tsv: 510 seconds loaded 74807831229 Bytes, about 139 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.4 seconds inserted 10000000 Rows, about 340K ops/s
storage size: 17162481757 Bytes

@morningman

Copy link
Copy Markdown
Contributor

run buildall

@github-actions

github-actions Bot commented Aug 9, 2023

Copy link
Copy Markdown
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen

Copy link
Copy Markdown
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.58 seconds
stream load tsv: 510 seconds loaded 74807831229 Bytes, about 139 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162138241 Bytes

@morningman morningman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Aug 10, 2023
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@BePPPower BePPPower left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 71807ce into apache:master Aug 10, 2023
heguanhui pushed a commit to heguanhui/doris that referenced this pull request Jun 7, 2026
…ble with non-ASAN builds

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: LocalFileSystemTest.TestGlob hardcodes the path
`./be/ut_build_ASAN/test/file_path/` for creating test files, but
`safe_glob()` looks up files under `config::user_files_secure_path`
(which resolves to `${DORIS_HOME}` = `${CMAKE_BUILD_DIR}/test/`).
In ASAN mode these paths happen to coincide, so the test passes. In
LSAN mode ${CMAKE_BUILD_DIR} is `be/ut_build_LSAN` while files are
still created under `be/ut_build_ASAN`, causing the glob to find
nothing and the test to fail.

This defect was introduced in commit 71807ce (PR apache#17404) which
added the test with an ASAN-specific hardcoded path.

### Release note

None

### Check List (For Author)

- Test: Unit Test
- Behavior changed: No
- Does this need documentation: No
heguanhui pushed a commit to heguanhui/doris that referenced this pull request Jun 8, 2026
…ble with non-ASAN builds

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: LocalFileSystemTest.TestGlob hardcodes the path
`./be/ut_build_ASAN/test/file_path/` for creating test files, but
`safe_glob()` looks up files under `config::user_files_secure_path`
(which resolves to `${DORIS_HOME}` = `${CMAKE_BUILD_DIR}/test/`).
In ASAN mode these paths happen to coincide, so the test passes. In
LSAN mode ${CMAKE_BUILD_DIR} is `be/ut_build_LSAN` while files are
still created under `be/ut_build_ASAN`, causing the glob to find
nothing and the test to fail.

This defect was introduced in commit 71807ce (PR apache#17404) which
added the test with an ASAN-specific hardcoded path.

### Release note

None

### Check List (For Author)

- Test: Unit Test
- Behavior changed: No
- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/planner Issues or PRs related to the query planner area/vectorization dev/2.0.1-merged kind/docs Categorizes issue or PR as related to documentation. merge_conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Table value function support reading local file

5 participants